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10/605,631 
Art Unit: 2166 

DETAILED ACTION 

1 . Receipt of Applicant's Amendment, filed 1 1/02/2007 is acknowledged. Claims 1 , 
1 1 , and 20 have been amended. Claims 1-20 are pending in this office action. 

Claim Rejections - 35 USC § 103 

2. The following is a quotation of 35 U.S.C. 103(a) which forms the basis for all 
obviousness rejections set forth in this Office action: 

(a) A patent may not be obtained though the invention is not identically disclosed or described as set 
forth in section 1 02 of this title, if the differences between the subject matter sought to be patented and 
the prior art are such that the subject matter as a whole would have been obvious at the time the 
invention was made to a person having ordinary skill in the art to which said subject matter pertains. 
Patentability shall not be negatived by the manner in which the invention was made. 

This application currently names joint inventors. In considering patentability of 
the claims under 35 U.S.C. 103(a), the examiner presumes that the subject matter of 
the various claims was commonly owned at the time any inventions covered therein 
were made absent any evidence to the contrary. Applicant is advised of the obligation 
under 37 CFR 1 .56 to point out the inventor and invention dates of each claim that was 
not commonly owned at the time a later invention was made in order for the examiner to 
consider the applicability of 35 U.S.C. 103(c) and potential 35 U.S.C. 102(e), (f) or (g) 
prior art under 35 U.S.C. 1 03(a). 

Claims 1-2, 4-7, 10-12, 14-17, and 20 are rejected under 35 U.S.C. 103(a) as 
being unpatentable over Steven J. Simske. (Simske hereinafter) (U.S. PG Pub No. 
2004/0133560) in view of Taher et al. (Taher hereinafter) (NPL "Evaluating Strategies 
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for Similarity Search on the Web" ACM, May 7-1 1 , 2002, PP 1-23,) further in view of 
Henkin etal. (Henkin hereinafter) (U.S. PG Pub No. 2002/0107735). 



With respect to claim 1 and 11, Simske teaches a method for computing a 
measure of similarity between a first (or input) document one or more disparate 
(or search results) documents, comprising: 

"(c) receiving a first list of rated keywords extracted from the first 
document and a list of rated keywords extracted from each of the one or more 
disparate documents" as organizing electronic documents may include generating a 
list of weighted keywords for each document (Simske Abstract, Fig. 4 and Paragraph 
0056). Paragraph 0056 teaches the many documents are being compared using 
shared/rated keyword lists. 

"wherein keywords are rated at least in part by a relevant weight from their 
associated document language" as word weight may be computed (step 107), among 
other methods, by counting the number of times that word (including pronouns of that 
word) occurs in the document to produce a word count. By multiplying the word count 
by a "mean role weight" and a square root of the word's lemma length, which are used 
to estimate the word's importance, a total'word weight is calculated. The "mean role 
weight" is determined by summing the average grammatical role weight, noun role 
weight, and layout role weight of a word. In the exemplary embodiment, the overall 
weight of each keyword is calculated (step 107) as shown in the following equation: 
Weight=(GRoleWeighti.times.NRoleWeighti.times.LayoutWeighti).times.sqrt(le- 
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ngth) (1) (Simske Paragraph 0028). This reference is finding weights in regards 
to the associated documents language because it is finding weight based on counting 
the number of times the word appears in that document. 

"(d) comparing the first list of rated keywords and the list of rated 
keywords from each of the one or more disparate documents to determine 
whether the first document forms part of the one or more disparate documents 
using a first computed percentage indicating what percentage of keyword ratings 
in the first list also exist in the list of at least one or more disparate documents" 
as the clustering process begins when the weighted keyword lists of two or more 
documents are compared (step 601). The host device calculates a value, called 
"shared word weight," that correlates the two documents. The shared word weight 
value indicates the extent to which two or more documents are related based on their 
keywords. A higher shared word weight indicates that the documents are more likely to 
be related (Simske Paragraph 0048). 

"(e) verifying inclusion of the first document in the one or more disparate 
documents by computing a second percentage for each of the one or more 
disparate documents indicating what percentage of keyword ratings along with a 
set of their neighboring keyword ratings in the first list also exist in the list for at 
least one of the one or more disparate documents when the first computed 
percentage indicates that the first document is included in at least one of the one 
or more disparate documents" as another possible way of weighting the relevancy 
metrics is to multiply the mean shared weight of extended words shared by two selected 
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text units, e.g., sentences, by the frequency metric of the shared extended words, i.e., 
the mean ratio of the extended word occurrences in the two documents compared to 
their occurrences in the larger corpus (Simske Paragraph 0064). 

"(f) using the first computed percentage to specify the measure of 
similarity when the computed second percentage for at least one of the one or 
more disparate documents is greater than the first computed percentage" as 
clustering documents with common titles, using weighted keywords to determine 
similarities between documents, etc., a preferred method uses a threshold shared word 
weight and a maximum, mean, or minimum shared word weight as explained above 
(Simske Paragraph 0055). 

Simske teaches the elements of claim 1 as noted above but does not explicitly 
teach, "receiving a first document and identifying the best keywords in the text by 
recognizing rare and uncommon keywords, including keywords that belong to 
one or more domain specific or subject matter dictionary and identifying 
documents similar to the first document using a query by formulating wrappers 
using the list of the best keywords identified in the first document that also 
appear in a DS dictionary," "ranking the one or more disparate documents 
indicating keyword ratings along with a set of their neighboring keyword ratings 
in the first also exit in the list for at least one of the one or more disparate 
documents," and "percentage of keywords and neighboring keywords." 

However, Taher teaches "receiving a first document and identifying the best 
keywords in the text by recognizing rare and uncommon keywords, including 
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keywords that belong to one or more domain specific or subject matter 
dictionary" as monotonic term weighting schemes, however, amplify the weight of 
terms with very low document frequency. This amplification is in fact good for ad-hoc 
queries, where a rare term in the query should be given the most importance . In the 
case where we are judging document similarities, rare terms are much less useful as 
they are often typos, rare names, or other nontopical terms that adversely affect the 
similarity measure. Therefore, we also experimented with nonmonotonic term-weighting 
schemes that attenuate both high and low document-frequency terms (Taher Page 9, 
3.3 Term Weighting). 

"identifying documents similar to the first document using a query by 
formulating wrappers using the list of the best keywords identified in the first 
document that also appear in a DS dictionary" as evaluation methodology has led us 
to the use of strategies that reflect the notion of "similarity" embodied in the popular 
ODP directory. For illustration, we have provided some sample queries in figure 13. In 
figure 14 we have given the top 10 words (by weight) in the bags for these query urls 
(Taher Page 18, 7.2 Quality of retrieved documents). Figure 14 shows the 10 best 
keywords for a query in a bag. 

"ranking the one or more disparate documents indicating keyword ratings 
along with a set of their neighboring keyword ratings in the first also exit in the 
list for at least one of the one or more disparate documents" as the goal of Web- 
page similarity search is to allow users to find Web pages similar to a query page [12]. In 
particular, given a query document, a similarity-search algorithm should provide a 
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ranked listing of documents similar to that document (Taher Page 1,1.0 Introduction). 
Taher also discloses choosing some fixed window size W , and always include W 

words to the left, and W words to the right, of .Specifically, we use 
W € {0,4,8, 16,32} 

. We use sentence, paragraph, and HTML-region-detection 

B u _ 

techniques to dynamically bound the region around that gets included in . The 
primary document features that are capable of triggering a window cut-off are paragraph 
boundaries, table cell boundaries, list item boundaries, and hard breaks which follow 
sentence boundaries. This technique resulted in very narrow windows that averaged 
close to only 3 words in either direction (Taher Page 8, 3.1 Choosing Terms). 

Further, Taher teaches, "wherein keywords are rated at least in part by a 
relevant weight from their associated document language" as (Taher Page 8-9, 3.3 
Term Weighting). 

It would have been obvious to one of ordinary skill in the art at the time the 
invention was made to combine the teaching of the cited references because Taher's 
teachings would have allowed Simske to provide reduced costs in both time and 
resources and providing efficient and quality results (Taher Page 17-18). 

Simske and Taher teach elements of claim 1 as noted above but do not 
explicitly teaches "percentage of keywords and neighboring keywords." 

However, Henkin discloses "percentage of keywords and neighboring 
keywords" as (Henkin Paragraph 0222 & 0288). 
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It would have been obvious to one of ordinary skill in the art at the time the 
invention was made to combine the teaching of the cited references because Henkin's 
teachings would have allowed Simske and Taher to determine the minimum 
percentage of matched words to be found in the document context in order to conclude 
that a match exists. 

With respect to claim 2, Simske teaches "the method according to claim 1 , 
wherein the second percentage at (c) is computed by giving weight only to those 
keywords and their set of neighboring keywords in the first list that match in the 
second list and a threshold percentage of the keywords in their set of 
neighboring keywords" as shown in Table 5, the documents share two keywords, 
"Hockey" and "Skating." The shared word weight value of the keywords may be chosen 
in a variety of ways, e.g., maximum, mean, and minimum (Simske Paragraph 0050). 

Claim 12 is essentially the same as claim 2 except it sets forth the claimed 
invention as a system and is rejected for the same reasons as applied hereinabove. 

With respect to claim 4, Simske teaches "the method according to claim 2, 
wherein the threshold percentage is reduced when the first list of rated keywords 
is identified using OCR" as the documents included in each cluster may be adjusted 
by changing the threshold of the required shared word weight for clustering (Simske 
Paragraph 0058). If any documents being considered are paper-based, tools such as a 
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zoning analysis engine in combination with an optical character recognition (OCR) 
engine may be used to convert the paper-based document to an electronic document 
(Simske Paragraph 0016). 

Claim 14 is essentially the same as claim 4 except it sets forth the claimed 
invention as a system and is rejected for the same reasons as applied hereinabove. 

With respect to claim 5, Simske does not explicitly teaches "the method 
according to claim 1, further comprising (e) if the first computed percentage does 
not indicate that the first document is included in the second document, 
computing a third percentage using the Jaccard distance measure." 

However, Taher discloses "(e) if the first computed percentage does not 
indicate that the first document is included in the second document, computing a 
third percentage using the Jaccard distance measure" as (Taher Page 9, 4 
Document Similarity Metric). 

It would have been obvious to one of ordinary skill in the art at the time the 
invention was made to combine the teaching of the cited references because Taher's 
teachings would have allowed Simske to provide reduced costs in both time and 
resources and providing efficient and quality results (Taher Page 17-18). 

Claim 15 is essentially the same as claim 5 except it sets forth the claimed 
invention as a system and is rejected for the same reasons as applied hereinabove. 
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With respect to claim 6, Simske and Taher do not explicitly teaches "the 
method according to claim 5, further comprising (f) if the third computed 
percentage indicates that the first document is a revision of the second 
document, computing a fourth percentage indicating what percentage of keyword 
ratings along with a set of their neighboring keyword ratings in the second list 
also exist in the first list." 

However, Henkin discloses "the method according to claim 5, further 
comprising (f) if the third computed percentage indicates that the first document 
is a revision of the second document, computing a fourth percentage indicating 
what percentage of keyword ratings along with a set of their neighboring keyword 
ratings in the second list also exist in the first list" as (Henkin Paragraph 0229 & 
0288). 

It would have been obvious to one of ordinary skill in the art at the time the 
invention was made to combine the teaching of the cited references because Henkin's 
teachings would have allowed Simske and Taher to determine the minimum 
percentage of matched words to be found in the document context in order to conclude 
that a match exists. 

Claim 16 is essentially the same as claim 6 except it sets forth the claimed 
invention as a system and is rejected for the same reasons as applied hereinabove. 
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With respect to claim 7, Simske teaches "the method according to claim 6, 
further comprising using the fourth computed percentage to specify the measure 
of similarity except when: (i) the fourth computed percentage is greater than the 
second computed percentage; (ii) the first list of rated keywords is identified 
using OCR; (iii) the fourth computed percentage is greater than fifty percent; and 
(iv) less than twenty percent of the keywords in the first list of keywords are in 
the second list of keywords" as if any documents being considered are paper-based, 
tools such as a zoning analysis engine in combination with an optical character 
recognition (OCR) engine may be used to convert the paper-based document to an 
electronic document (Simske Paragraph 0016). The keywords in the documents are 
being identified using OCR in the reference. Therefore, there is no need for using fourth 
computed percentage to specify the measure of similarity. 

t 

Claim 17 is essentially the same as claim 7 except it sets forth the claimed 
invention as a system and is rejected for the same reasons as applied hereinabove. 

With respect to claim 10, Simske teaches, "the method according to claim 1, 
wherein the first document is a portion of the second document" as a method and 
system for organizing electronic documents by generating a list of weighted keywords, 
clustering documents sharing one or more keywords, and linking documents within a 
cluster by using similar keywords, sentences, paragraphs, etc., as links. The 
embodiments provide customizable user control of keyword quantities, cluster 
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selectivity, and link specificity, i.e., links may connect similar paragraphs, sentences, 
individual words, etc (Simske Paragraph 0015). 

With respect to claim 20, Simske teaches an article of manufacture for 
computing a measure of similarity between a first (or input) document and a 
second (or search results) document, the article of manufacture comprising 
computer usable media including computer readable instructions embedded 
therein that causes a computer to perform a method wherein the method 
comprises: 

"(c) receiving a first list of rated keywords extracted from the first 
document and a second list of rated keywords extracted from the second 
document" as organizing electronic documents may include generating a list of 
weighted keywords for each document (Simske Abstract, & Fig. 4). 

"(d) using the first and second lists of rated keywords to determine 
whether the first document forms part of the second document using a first 
computed percentage indicating what percentage of keyword ratings in the first 
list also exist in the second list" as the clustering process begins when the weighted 
keyword lists of two or more documents are compared (step 601). The host device 
calculates a value, called "shared word weight," that correlates the two documents. The 
shared word weight value indicates the extent to which two or more documents are 
related based on their keywords. A higher shared word weight indicates that the 
documents are more likely to be related (Simske Paragraph 0048). 
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"(e) verifying inclusion of the first document in the second document 
computing a second percentage indicating what percentage of keyword ratings 
along with a set of their neighboring keyword ratings in the first list also exist in 
the second list when the first computed percentage indicates that the first 
document is included in the second document" as another possible way of 
weighting the relevancy metrics is to multiply the mean shared weight of extended 
words shared by two selected text units, e.g., sentences, by the frequency metric of the 
shared extended words, i.e., the mean ratio of the extended word occurrences in the 
two documents compared to their occurrences in the larger corpus (Simske Paragraph 
0064). 

"(f) using the first computed percentage to specify the measure of 
similarity when the second computed percentage is greater than the first 
computed percentage" as clustering documents with common titles, using weighted 
keywords to determine similarities between documents, etc., a preferred method uses a 
threshold shared word weight and a maximum, mean, or minimum shared word weight 
as explained above (Simske Paragraph 0055). 

"the fourth computed percentage to specify the measure of similarity 
except when: (i) the fourth computed percentage is greater than the second 
computed percentage; (ii) the first list of rated keywords is identified using OCR; 
(Hi) the fourth computed percentage is greater than fifty percent; and (iv) less 
than twenty percent of the keywords in the first list of keywords are in the second 
list of keywords" as if any documents being considered are paper-based, tools such 
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as a zoning analysis engine in combination with an optical character recognition (OCR) 
engine may be used to convert the paper-based document to an electronic document 
(Simske Paragraph 0016). The keywords in the documents are being identified using 
OCR in the reference. Therefore, there is no need for using fourth computed percentage 
to specify the measure of similarity. 

Simske teaches the elements of claim 20 as noted above but does not explicitly 
teaches "(a) receiving a first document and identifying the best keywords in the 
text by recognizing rare and uncommon keywords, including keywords that 
belong to one or more domain specific or subject matter dictionary," "(b) 
identifying documents similar to the first document using a query by formulating 
wrappers using the list of the best keywords identified in the first document that 
also appear in a DS dictionary" "(g) if the first computed percentage does not 
indicate that the first document is included in the second document, computing a 
third percentage using the Jaccard distance measure," "percentage of keywords 
and neighboring keywords" and "(h) if the third computed percentage indicates 
that the first document is a revision of the second document, computing a fourth 
percentage indicating what percentage of keyword ratings along with a set of 
their neighboring keyword ratings in the second list also exist in the first list." 

However, Taher discloses "(a) receiving a first document and identifying the 
best keywords in the text by recognizing rare and uncommon keywords, 
including keywords that belong to one or more domain specific or subject matter 
dictionary" as monotonic term weighting schemes, however, amplify the weight of 



Application/Control Number: Page 15 

10/605,631 

Art Unit: 2166 

terms with very low document frequency. This amplification is in fact good for ad-hoc 
queries, where a rare term in the query should be given the most importance . In the 
case where we are judging document similarities, rare terms are much less useful as 
they are often typos, rare names, or other nontopical terms that adversely affect the 
similarity measure. Therefore, we also experimented with nonmonotonic term-weighting 
schemes that attenuate both high and low document-frequency terms (Taher Page 9, 
3.3 Term Weighting). 

"(b) identifying documents similar to the first document using a query by 
formulating wrappers using the list of the best keywords identified in the first 
document that also appear in a DS dictionary" as evaluation methodology has led us 
to the use of strategies that reflect the notion of "similarity" embodied in the popular 
ODP directory. For illustration, we have provided some sample queries in figure 13. In 
figure 14 we have given the top 10 words (by weight) in the bags for these query urls 
(Taher Page 18, 7.2 Quality of retrieved documents). Figure 14 shows the 10 best 
keywords for a query in a bag. 

"(e) if the first computed percentage does not indicate that the first 
document is included in the second document, computing a third percentage 
using the Jaccard distance measure" as (Taher Page 9, 4 Document Similarity 
Metric). 

It would have been obvious to one of ordinary skill in the art at the time the 
invention was made to combine the teaching of the cited references because Taher's 
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teachings would have allowed Simske to provide reduced costs in both time and 
resources and providing efficient and quality results (Taher Page 17-18). 

Simske and Taher teach the elements of claim 20 as noted above but do not 
explicitly teaches "percentage of keywords and neighboring keywords" and "(f) if 
the third computed percentage indicates that the first document is a revision of 
the second document, computing a fourth percentage indicating what percentage 
of keyword ratings along with a set of their neighboring keyword ratings in the 
second list also exist in the first list." 

■ 

However, Henkin discloses "percentage of keywords and neighboring 
keywords" as (Henkin Paragraph 0222 & 0288) and "(f) if the third computed 
percentage indicates that the first document is a revision of the second 
document, computing a fourth percentage indicating what percentage of keyword 
ratings along with a set of their neighboring keyword ratings in the second list 
also exist in the first list" as (Henkin Paragraph 0229 & 0288). 

It would have been obvious to one of ordinary skill in the art at the time the 
invention was made to combine the teaching of the cited references because Henkin's 
teachings would have allowed Simske and Taher to determine the minimum 
percentage of matched words to be found in the document context in order to conclude 
that a match exists. 

3. Claims 3 and 13 are rejected under 35 U. S C. "103(a) as being unpatentable over 
Steven J. Simske. (U.S. PG Pub No. 2004/0133560) in view of Taher et al. (NPL 
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"Evaluating Strategies for Similarity Search on the Web" ACM, May 7-1 1 , 2002, PP 1- 

* 

23,) further in view of Henkin et al. (U.S. PG Pub No. 2002/0107735) as applied to 
claims 1-2, 4-7, 10-12, 14-17, and 20 above, further in view of Rie Kubota. (Kubota 
hereinafter) (U.S. Patent No. 6,041,323). 

With respect to claim 3, Simske teaches "the method according to claim 2, 
wherein the second percentage at (c) is computed by giving full weight to those 
keywords in the first list of rated keywords that cannot be accurately identified as 
having a complete set of neighboring keywords in the second set of keywords" 

as the experiment consists of varying the weighting, e.g., ranging the weight from 0.1 to 
10.0 using 0.1 steps, for a particular attribute (Simske Paragraph 0032). Examiner 
considers 10 as being full weight. 

Simske teaches the elements of claim 3 as noted above but does not explicitly 
disclose "keywords that cannot be accurately identified as having a complete set 
of neighboring keywords in the second set of keywords." 

However, Kubota discloses "keywords that cannot be accurately identified 
as having a complete set of neighboring keywords in the second set of 
keywords" as the fixed length chain is searched from the character chain file. In step 
508, if it is determined that no fixed length chain is found, a message box is preferably 
displayed in step 526 for indicating that the search character string cannot be found, 
and the process ends (Kubota Col 26, Lines 44-48). Therefore the reference teaches 
that keywords are not found in the second set of keywords/document. 
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It would have been obvious to one of ordinary skill in the art at the time the 
invention was made to combine the teaching of the cited references because Kubota's 
teachings would have allowed Simske and Henkin to provide a search method, which 
requires less storage capacity and extracts a unique character string at a high speed 
(Kubota Col 2, Lines 51-53) and to provide a method for searching for a comparison 
document, which has character strings similar to a partial input character string existing 
in an input document (Kubota Col 2, Lines 3-6). 

Claim 13 is essentially the same as claim 3 except it sets forth the claimed 
invention as a system and is rejected for the same reasons as applied hereinabove. 

4. Claims 9 and 19 are rejected under 35 U.S.C. 103(a) as being unpatentable over 
Steven J. Simske. (U.S. PG Pub No. 2004/0133560) in view of Taher et al. (NPL 
"Evaluating Strategies for Similarity Search on the Web" ACM, May 7-1 1, 2002, PP 1- 
23,) further in view of Henkin et al. (U.S. PG Pub No. 2002/0107735) as applied to 
claims 1-2, 4-7, 10-12, 14-17, and 20 above, in view of Drissi et al. (Drissi hereinafter) 
(U.S. PG Pub No. 20003/0149686). 

With respect to claim 9, Simske does not explicitly teaches "the method 
according to claim 1, wherein the first list of rated keywords includes one or more 
keywords translated from a second language different from a first language that 
is identified as being a primary language of the first document." 
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However, Drissi discloses "the method according to claim 1, wherein the 
first list of rated keywords includes one or more keywords translated from a 
second language different from a first language that is identified as being a 
primary language of the first document" as an inverted index 214 is created from the 
translated keywords. The translation of keywords is preferably accomplished using a 
keyword dictionary 220 which included words in English associated with the 
corresponding keywords in the national language (and vice versa) to form a synonym 
listing which effectively translates a keyword in one language into the corresponding 
term in another language and vice versa) (Drissi Paragraph 0024). Examiner interprets 
the national language as primary language. 

It would have been obvious to one of ordinary skill in the art at the time the 
invention was made to combine the teaching of the cited references because Drissi's 
teachings would have allowed Simske and Henkin to provide translation process to 
allow searching of the documents in different languages (Drissi Paragraph 0012). 

Claim 19 is essentially the same as claim 9 except it sets forth the claimed 
invention as a system and is rejected for the same reasons as applied hereinabove. 

5. Claims 8 and 18 are rejected under 35 U.S.C. 103(a) as being unpatentable over 
Steven J. Simske. (U.S. PG Pub No. 2004/0133560). 
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With respect to claim 8 Simske teaches "the method according to claim 1, 
wherein the first computed percentage indicates that the first document is 
included in the second document when the percentage defined by ratio of 
Sum1/Sum2 is greater than approximately ninety percent, where" as for example, if 
a threshold shared word weight value of 0.7 is designated, and the two documents of 
Table 5 are being compared for possible clustering, using the maximum shared word 
weight value (1 .0) will cluster the two documents, while using the mean shared word 
weight (0.5) or minimum shared word weight values (0.3) will not cluster the.two 
documents (Simske Paragraph 0052). Examiner interprets the threshold value of 70 
percent as 90 percent. 

"D1 is the number of keywords in first list of keywords" as table 5 with 
keywords from document 1 and document 2 (Simske Paragraph 0049). 

"D2 is the number of keywords in the second list of keywords" as table 5 
with keywords from document 1 and document 2 (Simske Paragraph 0049). 

"Sum1 is the sum of the weights of keywords that appear in D1 that also 
appear in D2" as the sum of all weight values for "Hockey" and "Skating" is 
0.4+0.25+0.3+0.05=1.0 (Simske Paragraph 0052). Hokey and Skating appear in both 
D1 and D2. 

"Sum2 is the sum of the weights of keywords in D1" as the keywords are 
located, a sentence weight is calculated (502), for example, by adding together all the 
keyword weights (Simske Paragraph 0045). 
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Simske teaches the elements of claim 8 as noted above but does not explicitly 
discloses "Sum1/Sum2." 

However, Simske teaches "Sum1/Sum2" as the mean shared word weight 
value is [fraction (1.0/2)]=0.5 (Simske Paragraph 0052). 

It would have been obvious to one of ordinary skill in the art at the time the 
invention was made to combine the teachings of the cited reference to find the ratio for 
two possible similar documents by dividing the sum of keywords from both documents 
by sum of keywords in one document. 

Claim 18 is essentially the same as claim 8 except it sets forth the claimed 
invention as a system and is rejected for the same reasons as applied hereinabove. 

Response to Arguments 

6. Applicant's arguments filed 1 1/02/2007 have been fully considered but they are 
not persuasive. 

In these arguments applicant relies on the amended claims and not the original 

ones. 

Applicant argues that Simske, Taher, and Henkin do not teach "receiving a 
first document and identifying the best keywords in the text by recognizing rare 
and uncommon keywords, including keywords that belong to one or more 
domain specific or subject matter dictionary and identifying documents similar to 
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the first document using a query by formulating wrappers using the list of the 
best keywords identified in the first document that also appear in a DS 
dictionary." 

However, Taher teaches "receiving a first document and identifying the best 
keywords in the text by recognizing rare and uncommon keywords, including 
keywords that belong to one or more domain specific or subject matter 
dictionary" as monotonic term weighting schemes, however, amplify the weight of 
terms with very low document frequency. This amplification is in fact good for ad-hoc 
queries, where a rare term in the query should be given the most importance . In the 
case where we are judging document similarities, rare terms are much less useful as 
they are often typos, rare names, or other nontopical terms that adversely affect the 
similarity measure. Therefore, we also experimented with nonmonotonic term-weighting 
schemes that attenuate both high and low document-frequency terms (Taher Page 9, 
3.3 Term Weighting). 

"identifying documents similar to the first document using a query by 
formulating wrappers using the list of the best keywords identified in the first 
document that also appear in a DS dictionary" as evaluation methodology has led us 
to the use of strategies that reflect the notion of "similarity" embodied in the popular 
ODP directory. For illustration, we have provided some sample queries in figure 13. In 
figure 14 we have given the top 10 words (by weight) in the bags for these query urls 
(Taher Page 18, 7.2 Quality of retrieved documents). Figure 14 shows the 10 best 
keywords for a query in a bag. 
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TJRJL 


Top Terms in BaR (Decreasing Order by Weight) 


mo ne vce Jit ra 1. ms n .com 


money, finance, msn, website, moneycentral, Stock, employment, microsoft, business, investor 


www.weather.oom 


weather, channel, forecasts, fbc, enter, travel, seek, best, national, usa 


wrw r w 1 _cnnfn J com 


finance* business^ cnn* cnnfn; stock* market* street* money 4 wall* journal 


www.mp3.com 


music, audio, player, artist, napster, radio, band, million, century, sohr 


Java, .sun .com 


java 5 jdk, techno! ogy, microsystems, api, applet, spacer, platform, language, website 


www.cdn c»w .com 


music, cdnow, amazon, records, bo»oks, sports, best, entertainment, favorite, audio 



Figure 14: Top 10 words from sample bags 



Further applicant argues that Simske does not teach or suggest "using the first 
and second lists of rated keywords to determine whether the first document 
forms part of the second document using a first computed percentage indicating 
what percentage of keyword ratings in the first list also exist in the second list," 
"computing a second percentage indicating what percentage of keyword ratings 
along with a set of their neighboring keyword ratings in the first list also exist in 
the second list when the first computed percentage indicates that the first 

* 

document is included in the second document." 

In response to applicant arguments examiner respectfully submits that Simske 
teaches "using the first and second lists of rated keywords to determine whether 
the first document forms part of the second document using a first computed 
percentage indicating what percentage of keyword ratings in the first list also 
exist in the second list" as the clustering process begins when the weighted keyword 
lists of two or more documents are compared (step 601). The host device calculates a 
value, called "shared word weight," that correlates the two documents. The shared 
word weight value indicates the extent to which two or more documents are related 
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based on their keywords. A higher shared word weight indicates that the documents 
are more likely to be related (Simske Paragraph 0048). 

"computing a second percentage indicating what percentage of keyword 
ratings along with a set of their neighboring keyword ratings in the first list also 
exist in the second list when the first computed percentage indicates that the first 
document is included in the second document" as another possible way of 
weighting the relevancy metrics is to multiply the mean shared weight of extended 
words shared by two selected text units, e.g., sentences, by the frequency metric of the 
shared extended words, i.e., the mean ratio of the extended word occurrences in the 
two documents compared to their occurrences in the larger corpus (Simske Paragraph 
0064). 

Simske teaches the elements of argued limitation as noted above but does not 
explicitly teach, "percentage of keywords and neighboring keywords." 

However, Henkin discloses "percentage of keywords and neighboring 
keywords" as (Henkin Paragraph 0222 & 0288). 

It would have been obvious to one of ordinary skill in the art at the time the 
invention was made to combine the teaching of the cited references because Henkin's 
teachings would have allowed Simske to determine the minimum percentage of 
matched words to be found in the document context in order to conclude that a match 
exists. 
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Further, regarding claim 20, applicant argues that Simske does not teaches "a 
fourth computed percentage to specify the measure of similarity except when: (i) 
the fourth computed percentage is greater than the second computed 
percentage; (ii) the first list of rated keywords is identified using OCR; (iii) the 
fourth computed percentage is greater than fifty percent; and (iv) less than twenty 
percent of the keywords in the first list of keywords are in the second list of 
keywords" 

In response to the preceding arguments examiner respectfully submits that 
Simske teaches "the fourth computed percentage to specify the measure of 
similarity except when: (i) the fourth computed percentage is greater than the 
second computed percentage; (ii) the first list of rated keywords is identified 
using OCR; (iii) the fourth computed percentage is greater than fifty percent; and 
(iv) less than twenty percent of the keywords in the first list of keywords are in 
the second list of keywords" as if any documents being considered are paper-based, 
tools such as a zoning analysis engine in combination with an optical character 
recognition (OCR) engine may be used to convert the paper-based document to an 
electronic document (Simske Paragraph 0016). The keywords in the documents are 
being identified using OCR in the reference. Therefore, there is no need for using fourth 
computed percentage to specify the measure of similarity. 

Conclusion 



* 
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7. THIS ACTION IS MADE FINAL. Applicant is reminded of the extension of time 
policy as set forth in 37 CFR 1.136(a). 

A shortened statutory period for reply to this final action is set to expire THREE 
MONTHS from the mailing date of this action. In the event a first reply is filed within 
TWO MONTHS of the mailing date of this final action and the advisory action is not 
mailed until after the end of the THREE-MONTH shortened statutory period, then the 
shortened statutory period will expire on the date the advisory action is mailed, and any 
extension fee pursuant to 37 CFR 1 .1 36(a) will be calculated from the mailing date of 
the advisory action. In no event, however, will the statutory period for reply expire later 
than SIX MONTHS from the mailing date of this final action. 

Contact Information 

8. Any inquiry concerning this communication or earlier communications from the 
examiner should be directed to Usmaan Saeed whose telephone number is (571)272-4046. 
The examiner can normally be reached on M-F 8-5. 

If attempts to reach the examiner by telephone are unsuccessful, the examiner's 
supervisor, Hosain Alam can be reached on (571)272-3978. The fax phone number for the 
organization where this application or proceeding is assigned is 571-273-8300. 
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Information regarding the status of an application may be obtained from the Patent 
Application Information Retrieval (PAIR) system. Status information for published applications 
may be obtained from either Private PAIR or Public PAIR. Status information for unpublished 
applications is available through Private PAIR only. For more information about the PAIR 
system, see http://pair-direct.uspto.gov. Should you have questions on access to the Private 
PAIR system, contact the Electronic Business Center (EBC) at 866-217-9197 (toll-free). 
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